SuperMinHash - A New Minwise Hashing Algorithm for Jaccard Similarity Estimation
نویسنده
چکیده
is paper presents a new algorithm for calculating hash signatures of sets which can be directly used for Jaccard similarity estimation. e new approach is an improvement over the MinHash algorithm, because it has a beer runtime behavior and the resulting signatures allow a more precise estimation of the Jaccard index.
منابع مشابه
b-Bit Minwise Hashing for Estimating Three-Way Similarities
Computing1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way...
متن کاملBagMinHash - Minwise Hashing Algorithm for Weighted Sets
Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very ecient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular rest...
متن کاملApproximately Minwise Independence with Twisted Tabulation
A random hash function h is ε-minwise if for any set S, |S| “ n, and element x P S, Prrhpxq “ minhpSqs “ p1 ̆ εq{n. Minwise hash functions with low bias ε have widespread applications within similarity estimation. Hashing from a universe rus, the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c “ Op1q lookups in tables of size u1{c. Twisted tabulation was invented to get good ...
متن کاملb-Bit Minwise Hashing for Large-Scale Linear SVM
Linear Support Vector Machines (e.g., SVM, Pegasos, LIBLINEAR) are powerful and extremely efficient classification tools when the datasets are very large and/or highdimensional, which is common in (e.g.,) text classification. Minwise hashing is a popular technique in the context of search for computing resemblance similarity between ultra high-dimensional (e.g., 2) data vectors such as document...
متن کاملOptimal Densification for Fast and Accurate Minwise Hashing
Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification (Shrivastava & Li, 2014a;c) have shown that it is possible to compute k minwise hashes, of a vector with d nonzeros, in mere (d + k) computations, a significant improvement over the classical O(dk). These advances have led to an algorithmic impr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1706.05698 شماره
صفحات -
تاریخ انتشار 2017